The nice thing about reproducible data analysis (like I'm trying to do it here on my blog) is, well, that you can quickly reproduce or even replicate an analysis.
So, in this blog post/notebook, I transfer the analysis of "Developers' Habits (IntelliJ Edition)" to another project: The famous open-source operating system Linux. Again, we want to take a look at how much information you can extract from a simple Git log output. This time we want to know
Because we use an open approach for our analysis, we are able to respond to newly created insights. Again, we use Pandas as data analysis toolkit to accomplish these tasks and execute our code in a Juypter notebook (find the original on GitHub.ipynb"). We also see some refactorings by leveraging Pandas' date functionality a little bit more.
So let's start!
I've already described the details on how to get the necessary data in my previous blog post. What we have at hand is a nice file with the following contents:
1514531161 -0800 Linus Torvalds torvalds@linux-foundation.org
1514489303 -0500 David S. Miller davem@davemloft.net
1514487644 -0800 Tom Herbert tom@quantonium.net
1514487643 -0800 Tom Herbert tom@quantonium.net
1514482693 -0500 Willem de Bruijn willemb@google.com
...
It includes the UNIX timestamp (in seconds since epoch), a whitespace, the time zone (where the authors live in), a tab separator, the name of the author, a tab and the email address of the author. The whole log shows 13 years of Linux development that is available on GitHub repository mirror.
We import the data by using Pandas' read_csv
function and the appropriate parameters. We copy only the needed data from the raw
dataset into the new DataFrame git_authors
.
In [1]:
import pandas as pd
raw = pd.read_csv(
r'../../linux/git_timestamp_author_email.log',
sep="\t",
encoding="latin-1",
header=None,
names=['unix_timestamp', 'author', 'email'])
# create separate columns for time data
raw[['timestamp', 'timezone']] = raw['unix_timestamp'].str.split(" ", expand=True)
# convert timestamp data
raw['timestamp'] = pd.to_datetime(raw['timestamp'], unit="s")
# add hourly offset data
raw['timezone_offset'] = pd.to_numeric(raw['timezone']) / 100.0
# calculate the local time
raw["timestamp_local"] = raw['timestamp'] + pd.to_timedelta(raw['timezone_offset'], unit='h')
# filter out wrong timestamps
raw = raw[
(raw['timestamp'] >= raw.iloc[-1]['timestamp']) &
(raw['timestamp'] <= pd.to_datetime('today'))]
git_authors = raw[['timestamp_local', 'timezone', 'author']].copy()
git_authors.head()
Out[1]:
First, we add the information about the weekdays based on the weekday_name
information of the timestamp_local
column. Because we want to preserve the order of the weekdays, we convert the weekday
entries to a Categorial
data type, too. The order of the weekdays is taken from the calendar
module.
Note: We can do this so easily because we have such a large amount of data where every weekday occurs. If we can't be sure to have a continuous sequence of weekdays, we have to use something like the pd.Grouper
method to fill in missing weekdays.
In [2]:
import calendar
git_authors['weekday'] = git_authors["timestamp_local"].dt.weekday_name
git_authors['weekday'] = pd.Categorical(
git_authors['weekday'],
categories=calendar.day_name,
ordered=True)
git_authors.head()
Out[2]:
In [3]:
git_authors['hour'] = git_authors['timestamp_local'].dt.hour
git_authors.head()
Out[3]:
In [4]:
%matplotlib inline
timezones = git_authors['timezone'].value_counts()
timezones.plot(
kind='pie',
figsize=(7,7),
title="Developers' timezones",
label="")
Out[4]:
Result
The majority of the developers' commits come from the time zones +0100, +0200 and -0700. With most commits coming probably from the West Coast of the USA, this might just be an indicator that Linus Torvalds lives there ;-) . But there are also many commits from developers within Western Europe.
In [5]:
ax = git_authors['weekday'].\
value_counts(sort=False).\
plot(
kind='bar',
title="Commits per weekday")
ax.set_xlabel('weekday')
ax.set_ylabel('# commits')
Out[5]:
Result
Most of the commits occur during normal working days with a slight peak on Wednesday. There are relatively few commits happening on weekends.
It would be very interesting and easy to see when Linus Torvalds (the main contributor to Linux) is working. But we won't do that because the yet unwritten codex of Software Analytics does tell us that it's not OK to analyze a single person's behavior – especially when such an analysis is based on an uncleaned dataset as we have it here.
In [6]:
ax = git_authors\
.groupby(['hour'])['author']\
.count().plot(kind='bar')
ax.set_title("Distribution of working hours")
ax.yaxis.set_label_text("# commits")
ax.xaxis.set_label_text("hour")
Out[6]:
Result
The distribution of the working hours is interesting:
In [7]:
latest_hour_per_week = git_authors.groupby(
[
pd.Grouper( key='timestamp_local', freq='1w'),
'author'
]
)[['hour']].max()
latest_hour_per_week.head()
Out[7]:
Next, we want to know if there were any stressful time periods that forced the developers to work overtime over a longer period of time. We calculate the mean of all late stays of all authors for each week.
In [8]:
mean_latest_hours_per_week = \
latest_hour_per_week \
.reset_index().groupby('timestamp_local').mean()
mean_latest_hours_per_week.head()
Out[8]:
We also create a trend line that shows how the contributors are working over the span of the past years. We use the polyfit
function from numpy
for this which needs a numeric index to calculate the polynomial coefficients later on. We then calculate the coefficients with a three-dimensional polynomial based on the hours of the mean_latest_hours_per_week
DataFrame. For visualization, we decrease the number of degrees and calculate the y-coordinates for all weeks that are encoded in numeric_index
. We store the result in the mean_latest_hours_per_week
DataFrame.
In [9]:
import numpy as np
numeric_index = range(0, len(mean_latest_hours_per_week))
coefficients = np.polyfit(numeric_index, mean_latest_hours_per_week.hour, 3)
polynomial = np.poly1d(coefficients)
ys = polynomial(numeric_index)
mean_latest_hours_per_week['trend'] = ys
mean_latest_hours_per_week.head()
Out[9]:
At last, we plot the hour
results of the mean_latest_hours_per_week
DataFrame as well as the trend
data in one line plot.
In [10]:
ax = mean_latest_hours_per_week[['hour', 'trend']].plot(
figsize=(10, 6),
color=['grey','blue'],
title="Late hours per weeks")
ax.set_xlabel("time")
ax.set_ylabel("hour")
Out[10]:
Result
We see no sign of significant overtime periods over 13 years of Linux development. Shortly after the creation of the Git mirror repository, there might have been a time with some irregularities. But overall, there are no signs of death marches. It seems that the Linux development team has established a stable development process.
Again, we've seen that various metrics and results can be easily created from a simple Git log output file. With Pandas, it's possible to get to know the habits of the developers of software projects. Thanks to Jupyter's open notebook approach, we can easily adapt existing analysis and add situation-specific information to it as we go along.